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Abstract 

research has shown how a mesh connected array of size N = 2^ K 
an integer, can be augmented by adding at most one edge per node such 
that it can perform a shuffle-exchange of size j in constant time. 

We now show how to perform a shuffle-exchange of size N on this 
augmented array in constant time. This is done by combining the avail- 
able perfect shuffle of size with the existing nearest neighbor con- 
nections of the mesh. By carefully scheduling the different permuta- 
tions that are composed in order to achieve the shuffle, the time 

required is reduced to 5 steps, which is shown to be optimal for this 
network. 


This research was supported by NASA Contract NAS1-17070 while 
author was in residence at the Institute for Computer Application; 
Science and Engineering (ICASE), NASA Langley Research Center. 
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1. Introduction 


K 

It has been shown [1] that a mesh connected computer of size N=2 , 
integer K, can be augmented by adding at most one edge per node so that 
it can carry out the shuffle-exchange permutation [2] of elements in 
constant time. On a 4-nearest neighbor mesh, for example, this requires 
3 time steps. 

The motivation for this work was to construct new interconnection 

structures that combine the capabilities of shuffle-exchange networks 

and nearest neighbor arrays but with cost less than the sum of the costs 

of the constituent networks. It is clearly trivial to superimpose two 

networks to get the sum of their capabilities. The augmentation 

described in [1] permits a mesh of size N to perform a shuffle exchange 
N 

of size — at the cost of at most one additional edge per node. A 
question that was unresolved in [1] is how efficiently the augmented 
network could carry out a shuffle-exchange of size N. 

In the present paper we demonstrate that this augmented network can 
be used to perform the shuffle-exchange of N points in constant time. 
Furthermore, we show that, by carefully scheduling the data transfers 
between different parts of the network, this can be done in 5 time 
steps, which is optimal for this network. 

2. The Augmented Mesh 

A shuffle-exchange network of size 8 is shown in Fig. 1. Fig. 2 
shows how a 4-nearest neighbor mesh of size 16 may be augmented by 
adding the required shuffle connections. It is clear that data at nodes 
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0,1,. . .,7 can be shuffled over to nodes 0", 1", . . . ,7". The exchange 
operation can then be performed using the vertical mesh connections 
between pairs of nodes O'-l', 2"-3" etc. and the results moved back to 

the original set of nodes 0, 1,..., 7 via the horizontal mesh connec- 

tions. For example, data in nodes 1 and 5 would, after shuffling, end 
up in nodes 2 ' and 3", respectively. The exchange would be performed 
using the vertical edge joining 2 ' and 3'. Finally data from 2 " and 3 ' 
would be shipped back to 2 and 3 via horizontal edges. The entire 
shuffle— exchange operation of size ^ requires 3 time steps. 

The augmentation procedure is to give unprimed labels 0 to (~ -l) 

to the odd columns and primed labels 0' to c| -1)' to the even columns. 

Shuffle connections are then added between these nodes. It is important 

to note that this augmentation can be applied to any array as long as it 

is a "true" rectangle (each size has at least 2 nodes along it) and the 

number of nodes is a power of 2. Such an array can be divided into — 

4 

squares of size 4 each (e.g. 0,0", 1,1' in Fig. 2) which can carry out 
the exchange operation. 

_3. Complete Shuffles on an Augmented Mesh 

We now show how a shuffle of size N can be performed on a mesh that 
has been augmented in the manner described in the previous section. As 
a running example, we use a rectangular mesh of size 32. 

The shuffle-exchange operation in the previous section utilizes all 
the added shuffle edges plus some of the original mesh edges. In Fig. 3 
we have deleted from our augmented rectangular mesh all connections 
unnecessary for the shuffle-exchange operation. The layout of the 
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original mesh is not important to our analysis. It could be configured 
as 2x16, 4x8, 8x4 or 16x2 nodes. In all cases we would add shuffle con- 
nections as described in the previous section. Our problem now is to 
show how a shuffle-exchange operation can be performed on the 32 data 
items stored in nodes 0..31 of Fig. 3. . 

Recall that the shuffle operation is defined [2] as follows: 

P(i) = 2i, 0 < i < | -1 (1) 
P(i) = 21-N+l, f < i < N-l (2) 

It is easy to trace through the network of Fig. 3 and see that a 
path of length no greater than 4 exists between any node i and its P(i), 
as defined by (1) and (2) above. This does not necessarily imply, how- 
ever, that the shuffle can be performed in four steps because for any 
nodes i and j, the paths i to P(i) and j to P(j) can have several common 
edges, leading to delays. 

Inspection of the edges in Fig. 3 reveals that they can be divided 
into four classes: the horizontal and vertical edges of the original 
network and the shuffle connections from left to top right and from left 
to bottom right. 

In the following, we specify routings between all i and their 
corresponding P(i) such that the shuffle operation can be performed in 
constant time, independent of the size of the array. In the next section 
we show how these routings can be scheduled so that the shuffle is done 
in optimal time. 
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The edges of the augmented network allow us to perform the follow- 


ing permutations. 

1) Vertical edges: 

V(i) = i + 1 0 < i < N-l, even i (3) 

V _1 (i) - i - 1 0 < i < N-l, oddi (4) 

2) Horizontal edges: 

H(i) = i + | 0 < i <| - 1 (5) 

H _1 (i) = i - | f < i < N-l (6) 

3) Shuffle edges from left towards top right: 

T(i) = 2i + | 0 < i < | -1 (7) 

— 1 i N N 

T “ *2 ~ 4" ^ 1 even i (8) 

4) Shuffle edges from left towards bottom right: 

B(i ) - 2i + 1 | < i < | -1 (9) 

B 1 (i) = — f < i < N-l, oddi (10) 


We will be applying the permutations (3)-(10) above on groups of 
nodes. For notational convenience, we replace "(i)" in (3)-(10) with 
the triple "[begin, end, step]". In this vector notation, V[1 - 1, 2] 
means that permutation V is applied to every second node, starting with 
node 1 and going up to node y - 1. 
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The following are the permutations that must be composed in order 
to obtain the perfect shuffle. (We use the left composition convention 
in this discussion.) 

H _1 [|, N-2, 2] o T[0, | -1, 1] (11) 
V _1 [f +1, N-l, 2] o B[|, f -1, 1] (12) 
H _1 [| +3, N-l, 4] o V[f +2 , N-2 , 4 ] o T[l, £ -1,2] o H -1 [| +1, f- -1,2] (13) 
v[0, Y -4, 4] o H -1 [£, N-4, 4] o T[0, £ -2, 2] o H -1 [£, -2, 2] (14) 

j 1] o H [j-, N -1, 1] (15) 

It is easy to verify that (11) to (15) correspond to (1) and (2) 
over the correct ranges. 

To avoid edge conflicts, we can successively apply (11) to (15) to 
the network. The perfect shuffle permutation can thus be achieved in 14 
time steps (the sum of the times required for ( 1 1 )— ( 15) ). The exchange 
permutation involves nothing more than the interchange of data via the 
vertical mesh links, i.e. the application of V and V -1 to all even and 
odd nodes respectively. This can be done in one step, resulting in a 
total of 15 time steps for the shuffle— exchange. 

4_» Optimal Scheduling 

In this section we describe how we can optimally perform the per- 
fect shuffle. We view this as a scheduling problem. The jobs are data 
0 through N. Permutations (3)-( 10) are the available processors which 
must be applied to these jobs according to the sequences ( 1 1 )— ( 15) . 
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Each permutation requires one time step. At any time step a datum may 
have only one permutation applied to it. Each permutation (strictly 
speaking, each edge) may be used only once during each time step. 

By inserting idle times (null permutations) very carefully, the 
schedule of Table I is obtained. In order to be consistent with the 
left composition convention, time advances from right to left in this 
table. It may be verified that this schedule satisfies all of the above 
constraints. 

The longest compositions, (13) and (14), are of length 4. Thus our 
schedule, also being of length 4, is optimal. A further step is 
required for the exchange, giving a total of 5 steps. 

Table II gives the instance of Table I for N=32 (corresponding to 
Fig. 13.) The extra leftmost column in this table shows the range of the 
last permutation in each row, demonstrating that all points are 
included. 

A* Conclusions 

We have shown that an augmented mesh of size N can perform the 
shuffle- exchange in constant time and have also shown how this can be 
done optimally in 5 time steps. This result indicates that the shuffle 
augmented mesh of [1] is a powerful interconnection structure that com- 
bines the advantages of nearest neighbor arrays and shuffle exchange 


networks. 



7 


_6. Acknowledgement 

The author is grateful to Dr. R. G. Voigt for several discussions 
and for a critical reading of the manuscript. 

_7. References 

[1] S. H. Bokhari and A. D. Raze, ''Augmenting Computer Networks," 
P roc » 1984 Int . Conf . on Parallel Proc . , August 1984, pp. 338-345. 

[2] H. S. Stone, "Parallel Processing with the Perfect Shuffle," IEEE 


Trans. Computers . , vol. C-20, No. 2, pp. 153-161, February 1971. 









9 




augmented 


Table I. Optimal schedule for the perfect shuffle. 
(Time advances from right to left.) 


H _1 [|, N-2, 2] 

IDLE 

IDLE 

T[0, | -1, 1] 

11 

IDLE 

V _1 [f +1, N-l , , 2] 

IDLE 

B[f, f -1, 1] 

12 

H-'tf +3,N-1,4] 

V[|+2,N-2,4] 

Ttl, f -1,2] 

H _1 [f +1, -p -1,2] 

13 

V[0, f -4, 4] 

N-4, 4] 

T[0, | - 2 , 2 ] 

f-2. 2] 

14 

IDLE 

B[f, f -1, 1] 

iT 1 ^, N -1, 1] 

IDLE 

15 


o 















Table II- Optimal schedule for N*32. 

The last (leftmost) column contains the range of the last permutation In each row. 


[0,2,4,6,8,10,12,14] 

H -1 [ 16, 18, 20, 22, 24, 26, 28, 30] 

IDLE 

IDLE 

T[0,l,2,3,4,5,6,7) 

11 

[16,18,20,22,24,26,28,30] 

IDLE 

V _1 [17, 19, 21, 23, 25, 27, 29, 31] 

IDLE 

B[8, 9, 10, 11, 12, 13, 14, 15] 

12 

[3,7,11,15] 

H -1 [ 19,23,27,31] 

V[18,22,26,30] 

T[l,3,5,7] 

H~* [17,19,21,23] 

13 

[1,5,9,13] 

V[0,4,8, 12] 

H~* [ 16,20,24,28] 

T[0,2,4,6] 

H -1 [16,18,20,22] 

14 

[17,19,21,23,25,27,29,31] 

IDLE 

B[8, 9, 10, 11, 12, 13, 14, 15) 

if 1 [24, 25, 26, 27, 28, 29, 30, 31] 

IDLE 

15 
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